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Abstract 

The evolution of protein-protein interactions over time has led to a com- 
plex network whose character is modular in the cellular function and highly 
correlated in its connectivity. The question of the characterization and emer- 
gence of modularity following principles of evolution remains an important 
challenge as there is no encompassing theory to explain the resulting mod- 
ular topology Here, we perform an empirical study of the yeast protein- 
interaction network. We find a novel large-scale modular organization of the 
functional classes of proteins characterized in terms of scale-invariant laws of 
modularity. We develop a mathematical framework and demonstrate a rela- 
tionship between the modular structure and the evolution growth rate of the 
interactions, conserved proteins, and topological length-scales in the system 
revealing a hierarchy of mutational events giving rise to the modular topol- 
ogy. These results are expected to apply to other complex networks providing 
a general theoretical framework to describe their modular organization and 
dynamics. 

Keywords: Complex networks, modularity, protein-protein interactions, 
time evolution 
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It is now a well-established fact that systems in biology, from protein- 
protein interaction networks to the network of metabolic pathways, self- 
organize into modular structures to preserve the overall network function 
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[H, 0, 0, 0, 0, 0, 0, 0] • We aim to unravel the large-scale organization of the 
modular properties of the network in order to develop a mathematical frame- 
work to describe the laws governing its evolution. Our approach is based on 
an empirical study of the protein interaction database of the yeast Saccha- 
romyces cerevisiae 0,0, 1 1 Of ] - Our analysis starts by separating the proteins 
in the network according to their functionality. Functional classes refer to 
groups of proteins that can be associated to a generic process, structure or 
intrinsic function among other classifications. We assi gn e ach protein to one 
of the annotations of gene functions performed in [l(J 11] (see Fig. [1]). The 



largest classes are translation, transcription, transcription control, protein 
fate, cellular organization, genome maintenance, cellular fate/organization, 
while the smaller classes are: energy production, amino-acid metabolism, 
other metabolism, transport and sensing and stress and defense. 

The inset of Fig. [1] shows the resulting topology according to the above 
global classification. Since not all the proteins in one class tend to be phys- 
ically associated, this classification does not reveal a clear modular organi- 
zation as is suggested in the inset of Fig. [TJ However, a novel level of or- 
ganization of the functional classes is revealed when we analyze the clusters 
of connected proteins belonging to the same functional class. It is visually 
apparent from Fig. [1] that the network separates into well defined modules 
or clusters of proteins within the different functional classes with a wide dis- 
tribution of sizes and no typical characteristic size. Our representation also 
reveals a broad distribution of topological distances between the clusters. 
We observe that some clusters are separated by large topological distances 
even though they belong to the same functional class (see for instance clus- 
ters of the translation class, light blue in Fig. [1]), while others are closely 
related (such as the clusters in transcription and transcription control, green 
in Fig. [H as expected). More importantly, we also observe a large degree of 
modularity (defined in mathematical terms later) since there are few links 
between the clusters and most of the links are concentrated inside the clus- 
ters 12|, |l3j. Furthermore, an effective repulsion is observed between the 
clusters (so-called dissasortativity or anticorrelations 14, 15||), since they are 
preferentially connected through nodes of lower connectivity and very few 
clusters are linked through the most connected nodes (red bonds in Fig. [I]). 
In what follows, we quantify the above observations using a mathematical 
framework and discuss their implications for the system's functionality and 
evolution rate. 

We measure the number of proteins or the mass M mass (£) in a given 
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cluster versus size t of the cluster. The size t is defined as the maximum 
distance between the proteins in the cluster (distance is measured as the 
minimum number of links between two proteins). Rather than the common 
view of network modularity, which proposes that the nodes are grouped in 
well-defined modules, our results indicate that clustering occurs on all length- 
scales We find that the mass of the clusters scales with the distance as a 
power-law of the form (see present yeast data in Fig. [2^): 

M mass (£)~fS (1) 

where the scaling exponent of the classes is d c = 1.9 ±0.1, and it plays the role 
of a topological dimension of the classes (analogous to a topological fractal 
dimension [l6|). Furthermore, the probability distribution P(M mass ) to find 
a cluster with mass M mass follows a power-law of the form: P(M mass ) ~ 
M~ aS g as seen in Fig. |5b. These scale-invariance laws quantify the large 
variability of the clusters and imply that large and small classes follow the 
same laws of evolution. It further suggests that the network system is critical, 



as understood by the terminology in phase transitions [17 . 

Next we investigate the modular organization of the network by the anal- 
ysis of the links inside and between modules. We tile the network with the 
minimum number of clusters or modules of proteins containing nodes within 



a distance I |18j . To capture the degree of the modularity of the network we 



define the modularity ratio: 

where L\ a is the number of links between nodes inside the module i, L l out is 
the number of links from module i connecting to other modules and N c is the 
number of modules needed to tile the network. Large values of Ai correspond 
to a structure where the modules are well separated and therefore to a higher 
degree of modularity. Indeed, similar measures to Eq. (j2J) are extensively 
used in the literature to detect modules or communities in complex networks 
ranging from biology to sociology 0, Esl, 20 1 . However, here we find that it 



is more relevant to consider the modularity at different scales of observation 
rather than the modularity of the entire network as used in previous studies 



where i is not considered [19]. Since modules exist on all scales, we expect 



that the degree of modularity will display similar organization. Indeed, we 
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find that the degree of modularity depends on the scale as: 



M{£)~ i dM (3) 

which defines the modularity exponent du = 1-9 ± 0.1 (see present yeast 
network data in Fig. [2b). The exponent cIm describes in a more universal 
fashion the modular organization in comparison with the actual value of Ai 
for the entire network as used before 0, 0, 2(|. Therefore, it can be used 



to compare the strength of modularity between dissimilar networks. The 
trivial case of a regular lattice in d dimensions gives Ai(£) ~ i d jl d ~ l ~ I and 
therefore dyi = 1. Modularity exponents larger than 1 indicate a large degree 
of modularity. When we randomly rewire the links in the network preserving 
the number of links per each node we obtain an exponent du ~ 0. The 
main feature of this random uncorrelated network is the clustering of all 
the conserved proteins in the core of the network with the consequent loss 
of modularity and functionality. Thus the exponent du reveals the level of 
correlations in the topology. 

The similarity between dM and d c is also significant. The number of links 
inside the modules is proportional to the number of nodes and therefore 
Lin{£) ~ Z dc '■ Combining with Eq. ([3]), we obtain that the number of links 
connecting the modules satisfy L out ~ £ dx , where the exponent d x = d c — du- 
When du ~ d c the network has attained the maximum degree of modularity 
under the constraint imposed by the scaling of the functional classes Eq. 
([Tj). In this case, d x ~ and L out (£) ~const implying that the modules are 
connected via few links with most of the links inside the modules. On the 
other hand the lowest degree of modularity corresponds to dM = 0. Since we 
find dM ^ dc m the yeast, we conclude that this network has attained a high 
degree of modularity as is evident in the plot of Fig. [TJ 

The biological question of a mathematical description of the dynamical 
evolution of the functional classes can now be addressed from the perspective 
of what we have learnt about structure and mechanisms of growth. During 
the course of the evolution of the species, from the first prokaryotes to the 
present day yeast, some genes have been conserved in all species, while oth- 
ers have diverged from the ancestral species to become specific to the more 
recent ones, through a number of mechanisms such as gene duplication, loss 
and de-novo creation, etc. Proteins of the present day yeast genome can 
therefore be separated according to the chronology of their a ppe arance in 



the domains of life that emerged through the history of time [10|, |2lj. Our 
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analysis refers to the evolution of conserved proteins which gives rise to the 
observed properties. Thus, we do not consider the loss of proteins during 
evolution. 



We use the classification of [10| to find the conserved proteins in the 



yeast network. The yeast genome is separated into four different classes [10 



proteins belonging to the present day yeast only, proteins found in fungi 
only, proteins belonging to other eukaryotes only and finally, the ancestral 
prokaryote protein network. Proteins that exist in both yeast and fungi 
interaction networks are part of the ancestral protein network, prior to the 
divergence of yeast from fungi 300 Myr ago. Analogously, those proteins 
that additionally appear in eukaryotes form an even older protein network, 
between 500 and 900 Myr ago, when fungi diverged from the rest of the 
eukaryotes. Finally, the ancient proteins in present day yeast are those that 
are in common with the oldest form of life, the prokaryotes, which diverged 
from the eukaryotes between 1.6 and 2.1 Gyr ago. Since we know the time 
of speciation of the yeast from other species, we can define three networks 
of conserved proteins as follows: (a) the network of yeast proteins that are 
in common with proteins in other fungi (fungi ancestral network with 1045 
conserved proteins) which is t± = — 300Myr old (-300 x 10 6 years, we consider 
the present time at to — 0). (b) The conserved proteins in common with 
animals and plants (eukaryote ancestral network with 872 proteins) at ti = 
—735 ± 165Myr, and (c) the ancestral prokaryote network with 451 proteins 
at t 3 = -1.85 ± 0.25Gyr. 

We have the knowledge of which conserved proteins persist from one 
evolution time step to the next, and which ones are new to the emergent 
species. We describe below a model for the emergence of functional modules 
of different sizes and modularity as stated in Eqs. ([T]) and ([3]). The process 
is illustrated in Fig. [3^ by following the evolution (from right to left) of 
the conserved protein CDC28 (which belongs to the genome maintenance 
class) from the ancestral prokaryote network to becoming a central node 
in the present time subnetwork of yeast with the 12 proteins shown in the 
left panel of the figure. At the present time the protein shares links with 
CLB5, SIC1, CLB1 and CLN1 among others. These proteins can be clustered 
inside a module of size i = 2 which becomes the conserved node (CDC28) 
in the previous time step. The reverse of this coarsening process follows 
the time evolution of the network and is consistent with duplication and 



divergence of genes [15J, |22|, |23|, |24j. The inheritance of interactions after 



duplication suggests that proteins CDC24 and CDC28 may have interacted in 
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the ancestral eukaryote network as shown in Fig. [3h.- This process can be seen 
as the duplication of the two conserved proteins with the younger proteins 
inheriting their interaction, and the older proteins losing the interaction. This 
mechanism explains the appearance of dissasortativity or ant icorrelat ions 
(i.e., the tendency of the conserved proteins to be connected preferentially to 
younger proteins of lower connectivity 14|, |l5|) which is relevant to the high 
degree of modularity of the network. 

The dynamical process can be represented as a tree (analogous to a den- 
dogram in studies of community detection in social sciences [25|) as depicted 
in Fig. [3b. Each leave in the tree represents a protein and the branches 
connect proteins that belong to the same module. This procedure identi- 
fies a hierarchy of nested modules defined at different scales. When such 
a procedure is applied to the entire interactome of the yeast, we identify 
the annotated functional classes as exemplified in Fig. [3b. Our results have 
implications for design of algorithms for accurate detection of modules and 
communities in complex network from biology to sociology 0, @, 0, EH, 20], 
since they could be adapted to incorporate the scaling of the modularity 
with the length of observation, Eq. (J5J), maximizing the modularity ratio at 
different length scales. Our method allows us to obtain biologically relevant 
information and predict the functionality of the proteins for which the func- 
tion is still unknown. For example, protein YLR132C whose function is yet 
unknown, is predicted to belong to the cellular fate functional class, since it 
falls deep inside this class in the tree. 

Next we demonstrate that the modular structure of the network is a con- 
sequence of dynamical processes characterized by specific exponential growth 
laws in opposition to randomness, well conservation law of modular- 
ity. This allows us to relate the scaling exponents of the modular structure 
to the growth rate of evolution of the network. The mathematical framework 
is analogous to that proposed in [15] to account for the fractal nature of com- 
plex networks, since it is based in the exponential growth laws of the network 
topology. Here we show that it explains the scale-invariant modular organi- 
zation describe above. We consider the distance between conserved proteins 
in the yeast network, £(to), and compare with the distance between the same 
proteins, £(t a ), in the previous networks with a = 1,2,3. As younger pro- 
teins are added to the network the distances between nodes increase. The 
evolution of the length-scales can be modeled by the following form (Fig. 
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£{t a ) = a{t a \t ) £(t ), a = 0,1, 2, 3, (4) 
where the generator a(t a \t ) is exponential with time (Fig. |3b): 

«(^o) = ||j=e^, (5) 

and the rate of growth of the distances is rg = 0.3/Gyr. 

The conservation of modularity under time evolution is the key to un- 
derstand the emergence of the modular organization stated in Eq. ([3]). In 
Fig. [3k, we demonstrated that the younger proteins are usually clustered 
around the conserved proteins, which raises a natural identification of mod- 
ules according to the different conserved proteins. Similarly with Eq. ([3]), we 
calculate the modularity ratio A4(t a ) from the connectivity in the present 
yeast network by clustering the modules around the conserved proteins of 
age t a . We obtain: 

i N(t a ) Ti (f \ 

where L\ n (t a ) and L l out (t a ) are the number of links between nodes inside 
and outside the module for the different age t a of conserved proteins. The 
scaling law fl3]) arises when we combine M.{t a ) with £(t a )/£(t ). We obtain, 
M(t a ) = (£(t a )/£(t ))- d ^, or 

M(t a ) = a(t a \t )~ d *' , (7) 

where du = 1.9. This relation is confirmed by an independent measurement 
of &m from Fig. [2b, which is used to fit the data in Fig. [Ib. The confirmation 
of the scaling in Figs. [2b and Hfc implies that the conserved proteins are 
preferentially contained within a separate class defined by a given length 
scale. The proposed mechanism is further confirmed with the prediction 
that Eqs. (jTJ and (jSj) are stable over time as shown in Figs. and |2t, 
respectively. 

Furthermore, we empirically find an exponential growth in the number of 
conserved proteins function of time: 

N(t a ) = n(t a \t ) N(t ), a = 1, 2, 3 (8) 
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where N(t a ) is the number of conserved proteins at time t a , and (see Fig. HU): 

n{tM = W) =erNta, (9) 

with a growth rate of conserved proteins, rjy = 0.56/Gyr. 

The scale-invariant organization of Eq. ([I]) can be explained by the ex- 
ponential growths Eq. (j3J) and (|9]). By combining both equations we obtain 
a power law relation between the distances and the number of conserved 
proteins with exponent given by the ratio of the growth rates, 

N{£) ~ t N ' r \ (10) 

or equivalently 

n{t a \t ) = a(t a \t ) rN/re . (11) 

We find that ^ = 1.9 as confirmed in Fig. 0fe. The ratio of rates agrees with 
the topological exponent from Eq. (JTJ): 

— = d c = 1.9. (12) 

n 

This result establishes a direct connection between dynamical (r N ,ri) and 
statical (d c ) properties. These properties show how the evolution rate of 
the distances between conserved proteins determine the present day modular 
organization of the functional classes. 

Equations (J3J), (JTJ) and (jSJ are the backbone of the laws of network mod- 
ularity and are summarized in Fig. [5] showing the equivalence between the 
static exponents and the growth rates. Our results indicate that the net- 
work is evolving by preferentially connecting the functional classes via low 
connectivity nodes (as exemplified by the very few red bonds in Fig. [TJ). 
Consequently the conserved proteins are dispersed in the network, providing 



the functional divergence and a level of insulation of the classes [14], [15| . 

The theoretical framework is complemented with a multiplicative law of 
the number of links. The degree distribution P(k) to find a node with k 
links displays a broad character of the form P(k) ~ k~ y [27], where the 
exponent 7 = 2.2 is the same for the networks of conserved proteins (Fig. 
Ek). Our analysis shows that the power-law degree distribution arises from 
the combination of two multiplicative processes in Eqs. (JS} and ffl3]) below. 

We consider the number of interactions k(t ) of each conserved protein in 
the present-day yeast organism, and compare this quantity with the degree 
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k(t a ) of the same protein in the ancient networks at time t a < to. We find 
(Fig. Eb) that the number of interactions also follows a linear multiplicative 
growth: 

k{t a ) = s(t a \t ) k(to), a = 0,1, 2, 3 (13) 

with 

sis}- 4 "*- (14) 

decreasing for the earlier protein networks. The growth rate is = 0.46 / Gyr. 
Equations (jSJ) and ( )13|) give rise to the broad distribution of connectivity 
while Eqs. @ and (Fl3|) describe how the degree of the conserved proteins 
scales with distance through the connectivity exponent dk (see below). 

We define N(k, t) as the number of nodes with degree k at time t. Then 
we have 

N(k,t) = N(t)P(k), (15) 

where P(k) is the degree distribution for any time. Then the density conser- 
vation law gives: 

N(k(t a ),t a )dk(t a ) = N(k(t Q ),t )dk(t ) (16) 
From this equation and Eqs. f)13p and (1151) we obtain: 

N(t a )P(sk(t ))sdk(t ) = N(t )P(k(t ))dk(t ), (17) 

from where we find that P(sk)dk = P(k)dk. The only probability distribu- 
tion satisfying this law is a power-law. Therefore we find that the degree 
distribution must be written as P(k) ~ fc -7 . Putting back the power law 
degree distribution into Eq. ([17]) we obtain 

N(t a ) = s-*- 1 N{t ), (18) 

or equivalently, 

1 + «j*o) =1 + !* (19) 

\ns{t a \t ) r k 

We plot the obtained n(t a \to) vs s(t a \t ) in Fig. [6t and fit the data with 
an independent measurement of 7 from P(k). Despite the short range of 
data set, the scaling theory is consistent with the empirical measurement. 
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The significance of this is to relate the growth rates, hm(t a \to)/\ns(t a \t ), 
to the static properties such as the exponent 7. 
Combining Eqs. (@J and f fT3|) we obtain, 

k(£)=£- d \ (20) 

which defines the dependence of the degree on the scale of observation. We 
measure the exponent dk = 1.5 from the static measurements which is given 
by dk = rk/rg, showing how the rate of evolution determines the present 
structure of the connectivity. 

The dynamical laws proposed in this study could be placed in the context 
of driving forces in evolution and principles governing it, with implications 
for network robustness. The failure or malfunction of a single module by 
deletion of a few highly connected nodes would not greatly affect the global 
stability of the network due to the tenuous connectivity between the mod- 
ules lij]. Networks that only follow random uncorrelated growth (like the 



preferential attachment rule leading to the scale- free networks |26|) are char- 
acterized by a central core of highly connected proteins (we find that they 
have du ~ 0). Such an organization violates the large-scale modularity of 
the network, rendering the scale-free networks non-functional. On the con- 
trary, here we find that evolution-constrained networks have evolved follow- 
ing stable scaling laws for modularity. This particular architecture isolates 
the conserved proteins from one another, increasing the robustness of the 
network. It is then possible to conjecture that the scale-invariant modular 
structure described in this work has been shaped by natural selection. 

Acknowledgements. This work is supported by a National Science 
Foundation grant, NSF-EF. 
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FIG. [TJ Topological structure and modularity in the protein interaction 
network of the yeast, showing clusters of proteins in different functionality 
classes. The database consists of 2493 high-confidence interactions between 
1293 proteins 0, 0, [To| • Each supernode in this network represents a cluster 
where the size is proportional to the mass of the cluster according to Eq. flTJ). 
The clusters are colored according to their functional classes. It is visually 
apparent that our clustering analysis reveals a wide size distribution. There 
is a tenuous connectivity of the clusters as implied by a large modularity 
ratio, Eq. (J2J). The red bonds correspond to interlinks between the most 
connected proteins in each module. The full interactome of the yeast without 
clustering analysis shown in the inset does not carry a clear information of 
modular structure. 

FIG. |2j Scaling laws of cluster mass and modularity, (a) Log-log plot 
of the mass of the clusters of proteins in the functional classes versus size 
according to Eq. ([I]) for the different networks. Each point is an average 
over many clusters in the network with the same (binned) £. We plot the 
average mass for each £. (b) Probability distribution of the mass of the clus- 
ters in the functional classes, P(M mass ), showing the power-law distribution: 
P(M mass ) ~ M~l^. (c) Log-log plot of the modularity ratio versus size of 
the modules for different networks according to Eq. ([3]). 

FIG. |3j Emergence of the modular structure and functional classes in the 
yeast proteome. (a) An example of the generation of the tree for the evolu- 
tion of protein CDC28 (which belongs to the genome maintenance functional 
class, see the shaded rectangle in Fig. |3fc for the exact location of this subtree) 
from the ancestral prokaryote network to the yeast network. The proteins 
are coloured according to their age (from red to green, see timeline). The 
four yeast proteins in green are clustered around CDC28 forming a module. 
Three modules are created centered in the nodes CDC24, CDC28 and CKS1 
from the fungi network to eukaryote. Finally all the eukaryote nodes form a 
module which is coarse-grained into the CDC28 node in the prokaryote net- 
work. The time evolution of the network is the reverse of this process, (b) 
The generation of the tree is shown in this figure. The colors of the branches 
of the tree represent different clusters, (c) Emergence of the functional classes 
in the yeast proteome through the application of the procedure explained in 
Fig. |3K,b. Here, time goes from the top of the tree to the bottom. The 
different colors of the tree correspond to different functional classes using the 
color-code of Fig. [TJ 

FIG. S] (a) Multiplicative law of the topological distances between con- 
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served proteins for different times according to Eq. (j3J). Each point is an 
average over many pair of nodes in the network with the same (binned) £(to). 
(b) Exponential growth with time of the topological distance between con- 
served proteins £(t a )/£(t ). (c) Log-log plot of M.(t a ) versus the length-scales 
a(t a \to) according to Eq. (J7J). Even though we can not fit the data due to the 
small number of data points, we show that an independent measurement of 
cIm from Fig. provides a fit to the data confirming Eq. (j7|). (d) Exponen- 
tial growth with time of the number of conserved proteins N(t a ). (e) Log-log 
plot of the number of conserved proteins versus the distances according to 
Eq. (|TT|) . Same considerations as in Fig. Hfc apply here. We do not attempt 
to fit these data due to the limited number of points (note that each point 
corresponds to a network of ancient conserved proteins). Instead we show 
the equivalence d c = r^/re by plotting a line with slope d c through the data. 
The value of d c is obtained from an independent estimation from Fig. [2^. 

FIG. |5] Summary of the results: conservative and multiplicative laws de- 
termine the scaling exponents (d c , dM, dk, 7) in terms of growth rates (r>, r N , r k ) 

FIG. M Scaling laws for the network connectivity, (a) The distribution 
P(k + k ) = (k + fco)~ 7 with 7 = 2.2 is the same for the present network and 
the network of conserved proteins. Here we use a small cut-off k = 0.6, see 
[27! ] . (b) We compare the number of links of nodes in the ancestral networks 
k(t a ) where a = 1,2,3 for the ancestral fungi, eukaryote and prokaryote 
networks, respectively, with the number of links of the same protein in the 
present time yeast network, A; (to)- Each point is an average over many pro- 
teins in the network with the same (binned) k(to). Here we add a small 
cut-off to the degree, k , which according to our results is k = 0.6. (c) 
Scaling of n(t a \to) ~ s(to,|to) 7_1 - Due to the limited number of datapoints 
we do not attempt to directly fit the data. The straight solid line is obtained 
from an independent measure of 7 from Fig. [6^, showing that relation ffTOl) 
is satisfied. 
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